The R package ggplot2 is dedicated to data visualization
in R. It can greatly improve the quality and aesthetics of your
graphics, and will make you much more efficient in creating them.
The author of the package, Hadley Wickham, was awarded the most
prestigious award for young statisticians, the COPSS Award, in 2019.
Hadley Wickham also created other great packages, including
tidyverse.
# load the ggplot2 package in R
library(ggplot2)
ggplot2 builds charts through layers using
geom_ functions, including geom_point,
geom_line, geom_bar,
geom_boxplot, geom_smooth,
geom_tile, geom_violin,
geom_hline, geom_vline,
geom_histogram, geom_sf,
geom_contour, geom_density,
geom_hex, geom_jitter, geom_map,
geom_area,
geom_path,geom_segment,geom_qq,
…
# View a summary of the data "mpg"
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
In order to create a plot, you:
ggplot() function on your data which creates a
blank canvas# create canvas
ggplot(data = mpg)
# map variables of interest
ggplot(mpg, aes(x = displ, y = hwy))
# add geom object
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
When we add the geom layer we use the addition (+) operator. As you add new layers you will always use + to add onto your visualization.
The aesthetic mappings take properties of the data and use them to influence visual characteristics, such as position, colour, size, shape, or transparency. Each visual characteristic can thus encode an aspect of the data and be used to convey information.
All aesthetics for a plot are specified in the aes()
function call (later in this tutorial you will see that each geom layer
can have its own aes specification).
# add a mapping from the class of the cars to a colour characteristic
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point()
Note that ‘aesthetics’ in ggplot refers to WHAT is plotted, not HOW
it is plotted, unlike the word’s usual meaning. Using the
aes() function will cause the visual to be based on the
data specified in the argument. For example, using
aes(colour = "blue") won’t cause the geometry’s colour to
be ‘blue’, but will instead cause the visual to be mapped from the
vector c("blue") — as if we only had a single class of car
that happened to be called ‘blue’. If you wish to apply an aesthetic
property to an entire geometry, you can set that property as an argument
to the geom method, OUTSIDE of the aes() call:
# illustrate the common mistake of trying to specify a colour within the aes() function
ggplot(mpg, aes(x = displ, y = hwy, colour = "blue")) +
geom_point()
# here is the correct way
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(colour = "blue")
# YOUR CODE HERE
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(colour = "red")
# YOUR CODE HERE
ggplot(mpg, aes(x = cty, y = hwy, colour = cyl)) +
geom_point()
Building on these basics, ggplot2 can be used to build almost any kind of plot you may want. These plots are declared using functions.
The most obvious distinction between plots is what geometric objects (geoms) they include. ggplot2 supports a number of different types of geoms, including:
geom_point()geom_line()geom_smooth()geom_bar()geom_boxplot()geom_histogram()geom_polygon()geom_map()Each of these geometries will leverage the aesthetic mappings supplied. For example, you can map data to the location, colour, and shape of a geom_point (e.g., points can be circles or squares), or you can map data to the linetype of a geom_line (e.g., solid or dotted).
Most geoms require an x and y mapping as a bare minimum.
# x and y mapping needed for geom_point and geom_smooth
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# no y mapping needed for geom_bar and geom_histogram
ggplot(data = mpg, aes(x = class)) +
geom_bar()
ggplot(data = mpg, aes(x = hwy)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# YOUR CODE HERE
ggplot(mpg, aes(x = drv, y = displ)) +
geom_boxplot()
# YOUR CODE HERE
ggplot(mpg, aes(x = cyl, y = displ)) +
geom_boxplot()
ggplot(mpg, aes(x = cyl, y = displ, group = cyl)) +
geom_boxplot()
ggplot(mpg, aes(x = factor(cyl), y = displ)) +
geom_boxplot()
What makes this approach really powerful is that you can add multiple geometries to a plot, allowing you to create complex graphics showing multiple aspects of your data.
# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Of course, the aesthetics for each geom can be different, so you could show multiple lines on the same plot (or with different colours, styles, etc). It’s also possible to give each geom a different data argument, so that you can show multiple data sets in the same plot.
For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colours within each geom:
# same as above, but points red and line blue
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(colour = "blue") +
geom_smooth(colour = "red")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
So if we specify an aesthetic within ggplot it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).
# colour aesthetic passed to each geom layer
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
geom_smooth(se = FALSE)
# colour aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE)
#YOUR CODE HERE
ggplot(mpg, aes(x = factor(cyl), y = displ)) +
geom_boxplot() +
geom_point()
In addition to a default statistical transformation, each geom also has a default position adjustment which specifies how different components should be positioned relative to each other. This position is noticeable in a geom_bar if you map a different variable to the colour visual characteristic:
# bar chart of class, coloured by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar()
The geom_bar by default uses a position adjustment of ‘stack’, which makes each rectangle’s height proportional to its value and stacks them on top of each other. We can use the position argument to specify what position adjustment rules to follow:
# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "dodge")
# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill")
Check the documentation for each particular geom to learn more about its positioning adjustments.
position = "jitter" in the point geometry layer. What
happens?# YOUR CODE HERE
ggplot(mpg, aes(x = factor(cyl), y = displ)) +
geom_boxplot() +
geom_point(position = "jitter")
ggplot(mpg, aes(x = factor(cyl), y = displ)) +
geom_boxplot() +
geom_jitter(width = 0.2)
# YOUR CODE HERE
ggplot(mpg, aes(x = hwy, fill = drv)) +
geom_histogram(position = "dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(x = hwy, fill = drv)) +
geom_histogram(position = "fill")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing missing values (`geom_bar()`).
ggplot(mpg, aes(x = hwy, fill = drv)) +
geom_histogram(position = "stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Whenever you specify an aesthetic mapping, ggplot uses a
particular scale to determine the range of values that the data should
map to. Thus when you specify
# colour the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point()
ggplot automatically adds a scale for each mapping to the plot:
# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
Each scale can be represented by a function with the following name:
scale_, followed by the name of the aesthetic property,
followed by an _ and the name of the scale. A continuous
scale will handle things like numeric data (where there is a continuous
set of numbers), whereas a discrete scale will handle
things like colours (since there is a small list of distinct
colours).
While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, you can use a scale to change the direction of an axis:
# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
Similarly, you can use scale_x_log10() and
scale_x_sqrt() to transform your scale. You can also use
scales to format your axes:
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, by = .2), labels = scales::percent)
A common parameter to change is which set of colours to use in a
plot. While you can use the default colouring, a more common option is
to leverage the pre-defined palettes from colourbrewer.org. These colour
sets have been carefully designed to look good and to be viewable to
people with certain forms of colour blindness. We can leverage colour
brewer palletes by specifying the scale_colour_brewer()
function, passing the pallete as an argument.
# default colour brewer
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
scale_colour_brewer()
# specifying colour palette
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
scale_colour_brewer(palette = "Set3")
Note that you can get the palette name from the colourbrewer website by looking at the scheme query parameter in the URL. Or see the diagram here and hover the mouse over each palette for the name.
You can also specify continuous colour values by using a gradient scale, or manually specify the colours you want to use as a named vector.
displ on the x axis and
hwy on the y axis, and colour by drv. Next, go
to colourbrewer.org and find an
appropriate colour palette that is colourblind safe. Update your plot
with that pallete.# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
geom_point() +
scale_colour_brewer(palette = "Dark2")
scale_y_continuous(), try increasing the
number of breaks so that there are breaks at 15, 20, 25, 30, 35 and so
on. Hint: Check the help file using
?scale_y_continuous.# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
geom_point() +
scale_colour_brewer(palette = "Dark2") +
scale_y_continuous(n.breaks = 6)
Facets are ways of grouping a data plot into multiple different
pieces (subplots). This allows you to view a separate plot for each
value in a categorical variable. You can construct a plot with multiple
facets by using the facet_wrap() function. This will
produce a “row” of subplots, one for each categorical variable (the
number of rows can be specified with an additional argument):
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(. ~ class)
You can also facet_grid() to facet your data by more
than one categorical variable. Note that we use a tilde (~)
in our facet functions. With facet_grid() the
variable to the left of the tilde will be represented in the rows and
the variable to the right will be represented across the columns.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(year ~ cyl)
displ
on the x axis and hwy on the y axis. Facet the plot by
drv using facet_grid().# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth() +
facet_grid(drv ~ .)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
scales='free' inside
facet_grid()? What has changed?# YOUR CODE HERE
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth() +
facet_grid(drv ~ ., scales = 'free')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Textual labels and annotations (on the plot, axes, geometry, and
legend) are an important part of making a plot understandable and
communicating information. Although not an explicit part of the Grammar
of Graphics (the would be considered a form of geometry),
ggplot makes it easy to add such annotations.
You can add titles and axis labels to a chart using the
labs() function (not labels, which is a
different R function!):
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
labs(title = "Fuel Efficiency by Engine Power",
subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
x = "Engine power (litres displacement)",
y = "Fuel Efficiency (miles per gallon)",
colour = "Car Type")
p <- ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point() +
labs(title = "Fuel Efficiency by Engine Power",
subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
x = "Engine power (litres displacement)",
y = "Fuel Efficiency (miles per gallon)",
colour = "Car Type")
pdf("myplot.pdf",width=6,height=4)
print(p)
dev.off()
## quartz_off_screen
## 2
A great website showcases the most if not all possibilities that you
can do with ggplot2: https://r-graph-gallery.com/.